-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: building within_intron and spliced subsets #134
base: main
Are you sure you want to change the base?
Conversation
Hi @IanSudberry, I've not sussed out why the tests fail yet but have been looking closely at some of the code and logic and have some In this PR I've noted... # Why add an empty list and append a tuple?
# Should we not be getting the keys of within_intron to check transcript_id is not in?
if transcript_id not in within_intron:
within_intron[transcript_id] = []
within_intron[transcript_id].append((chromosome, start, end, strand)) ...and thinking it through I realised that the reason this might be so is if there is more than one first = (1, 2, 3, 4)
second = ("a", "b", "c", "d")
test = {}
test["first"] = []
test["first"].append(first)
test["first"].append(second)
test["first"]
[(1, 2, 3, 4), ('a', 'b', 'c', 'd')] Reading through Line 239 -
|
Closes #91 Closes #92 Extracts the duplicated code that builds subsets of reads that are within introns (or span introns) and within spliced_3ui and abstracts out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and perhaps add some additional fixtures to use here.
743b339
to
ce1c18d
Compare
Okay, this all a bit confusing, but.... The XT tag on a read gives the gene_id of the gene to which the read is mapped. A gene can have multiple transcripts. utrons1/2 contain a list of all the introns associated with the gene in gene_id. If two transcripts contain the same intron, this will be in those lists twice. Consider the following situation:
Note that the introns list (which could be utrons1 or utrons2, but is likely to be the same for both) contains four introns. The loop goes through these lists and for each intron tests whether the current read supports either the splicing or retention of each intron. In this case, the final dictionaries will look like:
Note that in theory, a read could support to retention (or splicing) of multiple introns within one transcript (although I think its probably unlikely), so you could get In the end, we want data per intron, not per transcript, so where two transcripts contain the same intron, they are eventually collapsed onto the same record (i.e. here we have both introns[1] and introns[3], but we really only need that interval once). |
Great, thanks for that explanation @IanSudbery really useful. I will see about getting this into the documentation (noted in #136). I've worked out some transcripts that are more useful for the tests in this Pull Request and will be updating the Pull Request in a bit (just finishing them off). |
Identified instances where meaningful results are obtained when filtering for instances where the start/end are within introns or there is splicing. Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a whole region but I'm currently a bit hazy on whether this will always happen as the `query_length` is always `150` (`read_length` is longer and was the cause of previous errors).
5f9fed7
to
552d5f3
Compare
Identified instances where meaningful results are obtained when filtering for instances where the start/end are within introns or there is splicing. Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a whole region but I'm currently a bit hazy on whether this will always happen as the |
Closes #91
Closes #92
Extracts the duplicated code that builds subsets of reads that are within introns and within spliced_3ui and abstracts
out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and
perhaps add some additional fixtures to use here.